Search CORE

17 research outputs found

Measuring Harmful Representations in Scandinavian Language Models

Author: Nozza Debora
Touileb Samia
Publication venue
Publication date: 21/11/2022
Field of study

Scandinavian countries are perceived as role-models when it comes to gender equality. With the advent of pre-trained language models and their widespread usage, we investigate to what extent gender-based harmful and toxic content exist in selected Scandinavian language models. We examine nine models, covering Danish, Swedish, and Norwegian, by manually creating template-based sentences and probing the models for completion. We evaluate the completions using two methods for measuring harmful and toxic completions and provide a thorough analysis of the results. We show that Scandinavian pre-trained language models contain harmful and gender-based stereotypes with similar values across all languages. This finding goes against the general expectations related to gender equality in Scandinavian countries and shows the possible problematic outcomes of using such models in real-world settings.Comment: Accepted at the 5th workshop on Natural Language Processing and Computational Social Science (NLP+CSS) at EMNLP 2022 in Abu Dhabi, Dec 7 202

arXiv.org e-Print Archive

ADIOS LDA: When Grammar Induction Meets Topic Modeling

Author: Steskal Lubos
Touileb Samia
Publication venue: NIKT Foundation
Publication date: 22/11/2016
Field of study

We explore the interplay between grammar induction and topic modeling approaches to unsupervised text processing. These two methods complement each other since one allows for the identification of local structures centered around certain key terms, while the other generates a document wide context of expressed topics. This approach allows us to access and identify semantic structures that would be otherwise hardly discovered by using only one of the two aforementioned methods. Using our approach, we are able to provide a deeper understanding of the topic structure by examining inferred information structures characteristic of given topics as well as capture differences in word usage that would be hard by using standard disambiguation methods. We perform our exploration on an extensive corpus of blog posts centered around the surveillance discussion, where we focus on the debate around the Snowden affair. We show how our approach can be used for (semi-) automated content classification and the extraction of semantic features from large textual corpora

BIBSYS: Open Journals Systems

Constructions: a new unit of analysis for corpus-based discourse analysis

Author: Salway Andrew
Touileb Samia
Publication venue: Department of Linguistics, Faculty of Arts, Chulalongkorn University
Publication date: 01/01/2014
Field of study

We propose and assess the novel idea of using automatically induced constructions as a unit of analysis for corpus-based discourse analysis. Automated techniques are needed in order to elucidate important characteristics of corpora for social science research into topics, framing and argument structures. Compared with cur-rent techniques (keywords, n-grams, and collo-cations), constructions capture more linguistic patterning, including some grammatical phe-nomena. Recent advances in natural language processing mean that it is now feasible to auto-matically induce some constructions from large unannotated corpora. In order to assess how well constructions characterise the content of a corpus and how well they elucidate interesting aspects of different discourses, we analysed a corpus of climate change blogs. The utility of constructions for corpus-based discourse analy-sis was compared qualitatively with keywords, n-grams and collocations. We found that the unusually frequent constructions gave interest-ing and different insights into the content of the discourses and enabled better comparison of sub-corpora.

CiteSeerX

Waseda University Repository

Institutional Repositories DataBase (IRDB)

Automated Claim Detection for Fact-checking: A Case Study using Norwegian Pre-trained Language Models

Author: Khan Sohail Ahmed
Sheikhi Ghazaal
Touileb Samia
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Identifying Token-Level Dialectal Features in Social Media

Author: Barnes Jeremy
Lison Pierre
Mæhlum Petter
Touileb Samia
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Learning Horn Envelopes via Queries from Large Language Models

Author: Blum Sophie
Koudijs Raoul
Ozaki Ana
Touileb Samia
Publication venue
Publication date: 13/09/2023
Field of study

We investigate an approach for extracting knowledge from trained neural networks based on Angluin's exact learning model with membership and equivalence queries to an oracle. In this approach, the oracle is a trained neural network. We consider Angluin's classical algorithm for learning Horn theories and study the necessary changes to make it applicable to learn from neural networks. In particular, we have to consider that trained neural networks may not behave as Horn oracles, meaning that their underlying target theory may not be Horn. We propose a new algorithm that aims at extracting the "tightest Horn approximation" of the target theory and that is guaranteed to terminate in exponential time (in the worst case) and in polynomial time if the target has polynomially many non-Horn examples. To showcase the applicability of the approach, we perform experiments on pre-trained language models and extract rules that expose occupation-based gender biases.Comment: 35 pages, 2 figures; manuscript accepted for publication in the International Journal of Approximate Reasoning (IJAR

arXiv.org e-Print Archive

Identifying Token-Level Dialectal Features in Social Media

Author: Barnes Jeremy Claude
Lison Pierre
Mæhlum Petter
Touileb Samia
Publication venue
Publication date: 01/01/2023
Field of study

Dialectal variation is present in many human languages and is attracting a growing interest in NLP. Most previous work concentrated on either (1) classifying dialectal varieties at the document or sentence level or (2) performing standard NLP tasks on dialectal data. In this paper, we propose the novel task of token-level dialectal feature prediction. We present a set of fine-grained annotation guidelines for Norwegian dialects, expand a corpus of dialectal tweets, and manually annotate them using the introduced guidelines. Furthermore, to evaluate the learnability of our task, we conduct labeling experiments using a collection of baselines, weakly supervised and supervised sequence labeling models. The obtained results show that, despite the difficulty of the task and the scarcity of training data, many dialectal features can be predicted with reasonably high accuracy.publishedVersio

University of Bergen

Making sense of nonsense : Integrated gradient-based input reduction to improve recall for check-worthy claim detection

Author: Opdahl Andreas Lothe
Setty Vinay
Sheikhi Ghazaal
Touileb Samia
Publication venue: CEUR
Publication date: 01/01/2023
Field of study

Analysing long text documents of political discourse to identify check-worthy claims (claim detection) is known to be an important task in automated fact-checking systems, as it saves the precious time of fact-checkers, allowing for more fact-checks. However, existing methods use black-box deep neural NLP models to detect check-worthy claims, which limits the understanding of the model and the mistakes they make. The aim of this study is therefore to leverage an explainable neural NLP method to improve the claim detection task. Specifically, we exploit well known integrated gradient-based input reduction on textCNN and BiLSTM to create two different reduced claim data sets from ClaimBuster. We observe that a higher recall in check-worthy claim detection is achieved on the data reduced by BiLSTM compared to the models trained on claims. This is an important remark since the cost of overlooking check-worthy claims is high in claim detection for fact-checking. This is also the case when a pre-trained BERT sequence classification model is fine-tuned on the reduced data set. We argue that removing superfluous tokens using explainable NLP could unlock the true potential of neural language models for claim detection, even though the reduced claims might make no sense to humans. Our findings provide insights on task formulation, design of annotation schema and data set preparation for check-worthy claim detection.publishedVersio

University of Bergen